Automatically describing video content with natural language is a fundamentalchallenge of multimedia. Recurrent Neural Networks (RNN), which models sequencedynamics, has attracted increasing attention on visual interpretation. However,most existing approaches generate a word locally with given previous words andthe visual content, while the relationship between sentence semantics andvisual content is not holistically exploited. As a result, the generatedsentences may be contextually correct but the semantics (e.g., subjects, verbsor objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memorywith visual-semantic Embedding (LSTM-E), which can simultaneously explore thelearning of LSTM and visual-semantic embedding. The former aims to locallymaximize the probability of generating the next word given previous words andvisual content, while the latter is to create a visual-semantic embedding spacefor enforcing the relationship between the semantics of the entire sentence andvisual content. Our proposed LSTM-E consists of three components: a 2-D and/or3-D deep convolutional neural networks for learning powerful videorepresentation, a deep RNN for generating sentences, and a joint embeddingmodel for exploring the relationships between visual content and sentencesemantics. The experiments on YouTube2Text dataset show that our proposedLSTM-E achieves to-date the best reported performance in generating naturalsentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We alsodemonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)triplets to several state-of-the-art techniques.
展开▼